Collaborative Web Crawling: Information Gathering/Processing over Internet

نویسندگان

  • Shang-Hua Teng
  • Qi Lu
  • Matthias Eichstaedt
  • Daniel Alexander Ford
  • Tobin J. Lehman
چکیده

The main objective of the IBM Grand Central Station (GCS) is to gather information of virtually any type of formats (text, data, image, graphics, audio, video) from the cyberspace, to process/index/summarize the information, and to push the right information to the right people. Because of the very large scale of the cyberspace, parallel processing in both crawling/gathering and information processing is indispensable. In this paper, we present a scalable method for collaborative web crawling and information processing. The method includes an automatic cyberspace partitioner which is designed to dynamically balance and re-balance the load among processors. It can be can be used when all web crawlers are located on a tightly coupled high-performance system as well as when they are scattered in a distributed environment. We have implemented our algorithms in Java.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Study of Mobile Agent Platforms for Distributed Web Crawling

In the traditional centralized crawling techniques the pages from all over the web are brought to the search engine side which results a lot of unnecessary Internet traffic. In the distributed crawling with migrating agents, the mobile agent is send to the web server side that brings only required pages to the search engine side which reduces unnecessary overhead. The mobile (migrating) agents ...

متن کامل

Collecte orientée sur le Web pour la recherche d'information spécialisée. (Focused document gathering on the Web for domain-specific information retrieval)

Focused document gathering on the Web for domain-specific information retrieval Vertical search engines, which focus on a specific segment of the Web, become more and more present in the Internet landscape. Topical search engines, notably, can obtain a significant performance boost by limiting their index on a specific topic. By doing so, language ambiguities are reduced, and both the algorithm...

متن کامل

Automatic relevant Source Discovery over the Internet based on user profile

The enormous growth of the Web in recent years has made difficult the discovery of new sources of interest on a given topic, even thanks to an existing set of relevant sources. To address this problem, we introduce an approach to provide users with new relevant sources of information by exploiting their needs. It aims at combining a personalized crawler with a collaborative filtering system. We...

متن کامل

RIDIRE-CPI: an Open Source Crawling and Processing Infrastructure for Web Corpora Building

This paper introduces the RIDIRE-CPI, an open source tool for the building of web corpora with a specific design through a targeted crawling strategy. The tool has been developed within the RIDIRE Project, which aims at creating a 2 billion word balanced web corpus for Italian. RIDIRE-CPI architecture integrates existing open source tools as well as modules developed specifically within the RID...

متن کامل

On Learning Strategies for Topic Specific Web Crawling

Crawling has been a topic of considerable interest in recent years because of the rapid growth of the world wide web. In many cases, it is possible to design more effective crawlers which can find web pages belonging to specific topics. In this paper, we will discuss some recent techniques for crawling web pages belonging to specific topics. We discuss the following classes of techniques: (1) I...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999